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ABSTRACT 
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A  discussion  is  made  of  nonparametric  versus  parametric  methods  for  the  estimation  of 
probability  densities.  A  new  algorithm  for  nonparametric  density  estimation  is  given  and 
its  performance  compared  with  state-of-the-art  kernel  estimation  algorithms. 

Key  words:  computational  feasibility,  maximum  likelihood,  Pearson  family,  <1  el  estimates, 
penalized  maximum  likelihood. 

1.  INTRODUCTION 


Two  major  causes  for  poor  (especially  nonrobust)  optimization  theoretic  techniques  ia 
statistics  are 

(1)  an  inappropriate  choice  of  a  parameter  (function)  space 

and 

(2)  an  inappropriate  choice  of  a  criterion  function  (functional). 

"Appropriateness"  is  determined  by  a  balance  between  computational  feasibility  end  ap¬ 
proximation  to  truth.  It  is  to  be  expected  that  the  advent  of  thehigh  speed  digital  computer 
should  drastically  raise  our  pain  threshold  of  computational  feasibility.  Consequently  it  Is 
somewhat  surprising  that  most  standard  statistical  procedures  have  remained  unchanged  since 
the  1930's.  Many  of  these  involve  the  estimation  of  probability  densities. 

2.  DISCUSSION 

In  1922  Fisher  [1]  presented  the  concept  of  parametric  maximum  likelihood  estimation. 

We  recall  that  his  development  requires  the  functional  form  of  the  unknown  density  f(x|8) 
be  known.  Given  a  random  sample  (x^x-,...,*  }  from  f,  we  seek  that  value  8^(5)  con¬ 
tained  in  appropriate  parameter  space  Scr  which  maximizes 


which  maximizes 
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log  fnfe|e)  . 


Then  under  very  general  conditions, 
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The  latter  result  is  particularly  appealing,  since  It  states  that  the  parametric  maximum 
likelihood  estimator  asymptotically  achieves  the  Cauchy-Schwarz  (Cramer-Rao)  lower  bound 

for  Ef(0  -e)2],  where  &€0,  the  class  of  unbiased  estimates  for  6  . 
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The  optimality  properties  of  parametric  maximum  likelihood  algorithms  ere  likely  to  be 
of  little  utility  if  (as  is  generally  the  case)  we  do  not  have  a  good  Idea  as  to  the 
functional  form  of  the  unknown  density.  For  example,  if  we  assume  the  density  is  normal,  the 
maximum  likelihood  estimator  for  the  median  9  is  5?  .  If,  in  fact,  the  underlying  dis- 
tribution  is  Cauchy,  x  is  no  better  an  estimator  for  8  than  any  single  one  of  tho 
observations.  In  general,  if  we  assume  an  Incorrect  functional  form  of  the  density  and  use 
any  of  the  classical  parametric  techniques  for  estimating  the  density,  we  will  find  that 


li®  J  E  Yf  (x)  -  f  (x)  V  dx  >  0  .  (4) 

n-o>  -m  \  est,n  true/ 


The  pathology  of  parametric  maximum  likelihood  estimation  under  real  world  conditions 
should  not  be  unexpected.  An  optimization- theoretic  technique  designed  to  have  good  per¬ 
formance  under  very  restrictive  conditions  (e.g.,  that  the  functional  form  of  the  density 
is  known)  is  unlikely  to  perform  well  when  we  step  outside  the  domain  of  these  conditions. 
We  need  to  devise  algorithms  which  are  "optimal"  in  a  more  general  and  realistic  setting. 
This  point  was  implicitly  raised  a  quarter  century  before  maximum  likelihood  by  Karl 
Pearson  [7].  (For  a  discussion  of  the  Fisher-Pearson  battle  on  maximum  likelihood,  the 
reader  is  referred  to  (13).)  He  considered  a  fairly  large  class  of  probability  densities 
characterized  by  the  differential  equation 


d  log  f (x) 
dx 


x  -  a _ 

b  +b,x+b,x2 

O  1  / 


(5) 


The  estimation  of  the  four  parameters  is  readily  carried  out  via  the  first  four  sample 
moments.  Unfortunately,  although  the  Pearson  Family  contains  many  of  the  classical 
distributions,  it  has  serious  deficiencies.  For  example,  it  contains  no  multimodal  densities. 

In  order  to  obtain  a  practical  extension  of  Pearson's  concept  to  density  estimation  in 
the  general  setting  where  we  know  only  that  the  underlying  density  is  "smooth”,  we  must  de¬ 
velop  an  estimator  where  the  number  of  characterizing  parameters  increases  with  the  sample 
size.  The  simple  histogram  (dating  back  to  John  Graunt  in  1662  [3])  has  such  a  property 
but  suffers  from  discontinuities.  These  may  be  eliminated  quite  readily  by  connecting  mid¬ 
points  with  straight  lines.  The  extreme  "locality"  of  the  histogram  s  less  easily 
ameliorated. 

Computationally  more  complicated  but  possessing  better  consistency  properties  Chan. the 
histogram  is  the  kernel  density  estimator  (or  "shifted  histogram"  [12],  [6],  (8]).  Here,  on 
the  basis  of  a  random  sample  (x^x^,,...^  )  we  have  the  estimator 


j-1 

where  K  is  any  probability  density  having 

0» 

X  |K(y)|dy  <- 
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llmlyK(y)|  -  0  . 
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To  minimize  the  asymptotic  Integrated  mean  square  error,  we  have  the  optimal 
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l2f(f"(x))2dxj 
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which  gives  as  asymptotic  integrated  mean  square  error 

IMSE  -  24/591/5  |^J‘(f,,(x))2dxj1/5n  4/5 


cil) 


A 

Unfortunately,  the  design  parameter  h  requires  approximate  knowledge  of  f(f"(x))  ox  . 

An  iterative  algorithm  for  the  estimation  of  h  is  given  in  [12).  Monte  £arlo  results 
indicate  that  a  twofold  over estimation  or  underestimation  of  h  typically  causes  a  two* 
fold  increase  of  the  IMSE  over  that  shown  in  (U).  A  survey  of  other  nonparametric 
density  estimation  techniques  is  given  in  [13]. 

A  new  approach  motivated  by  a  suggestion^of  Good  [2]  has  been  considered  In  [4),  [5), 
[11],  [13).  Here  we  seek  that  density  f  €H*(a,b)  which  maximises  the  criterion,  functional 
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f<k)€  L2(a,b);  k  -  0,1 . a 

f(k)(a)  «*  f(k)(b>  =0;  k  -  0,1,2,.  ..,s-l 


£  >  0 

Jbf(x)dx  -  1  . 


The  solution  to  (12)  is  referred  to  as  the  maximum  penalized  likelihood  estimator.  From  [5) 
we  have 

Theorem.  The  MPLE  estimator  exists  and  is  unique.  ■ 


Recently,  a  discretized  approximation  to  the  solution  of  (12)  has  been  algorithm! tixed 
and  investigated  by  Scott  [10],  [11].  This  work  suggests 

Theorem.  If  fn(*)  is  the  solution  to  the  MPLE  criterion  and  fg€H*(a,b)  then 


a 


E[(fn(x)  -  fT(x))2Jdx-?L»0 


where  t^(-)  is  the  density  f  truncated  to  (a,b). 


€13) 


From  a  practical  standpoint,  the  performance  of  1 (.)  is  relatively  insensitive  to  the 
selection  of  the  design  parameters  a  .  If  ve  set  all  the  or.  ■*  0  except  for  or.,  it  is 
not  unusual  for  a  change  of  hy  a  factor  of  100  from  the  optimal  to  increase  the  IMSE  by 
less  than  a  factor  of  2  . 


In  Table  1,  we  compare  the  IMSE  of  the  MPLE  with  that  of  popular  Gaussian  kernel  estimator 
for  various  densities  and  sample  sizes.  Of  special  note  is  the  fact  that  although  we  have 
used  the  optimal  (and  unobtainable)  design  parameter  for  the  kernel  estimator,  we  have  used 
the  suboptlmal  value  of  Oj  *  10  throughout  for  the  MPLE  estimator. 
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TABLE  I 


IMSE  Values  of  Che  MPLE  (a^  >10)  and  Gaussian  Kernel  Density  Estimation 
(with  optimal  b)  for  Various  Distributions  and  Sample  Sizes. 


Density 

n 

MPLE 

IMSE 

Kernel 

IMSE 

N(0,1) 

25 

.0027 

.0041 

100 

.00079 

.00129 

400 

.00033 

.00053 

&N(-1-5,1) 

25 

.00159 

.00128 

+iN(  1.5,  1) 

100 

.00054 

.00052 

c5 

25 

.00282 

.00475 

100 

.00084 

.00157 

3.  CONCLUSIONS 


The  supposed  optimality  of  classical  parametric  density  estimation  procedures  is 
frequently  Invalid  because  the  true  functional  form  of  the  density  13  unknown.  Never¬ 
theless,  we  can  attack  the  more  general  and  practical  problem  of  estimating  a  density 
of  unknown  functional  form.  The  maximum  penalized  likelihood  density  estimator  has  been 
algo/ithmitlzed  and  is  now  a  part  of  standard  statistical  software  {llj.  , 
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